{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Word Embeddings\n",
    "- Using embedding layer in Keras\n",
    "- Another way of learning word embeddings is via pre-training word vectors in another network (e.g., word2vec, GloVe, fasttext, etc.)\n",
    "    - If you are interested in pretrained word embeddings, refer to: https://github.com/buomsoo-kim/Word-embedding-with-Python\n",
    "    \n",
    "<br>\n",
    "<img src=\"https://adriancolyer.files.wordpress.com/2016/04/word2vec-distributed-representation.png?w=600\" style=\"width: 600px\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Word vectors\n",
    "- Long story short, word embedding is process of converting each word to a fixed dimensional \"(word) vector\"\n",
    "- Dimensionality of embedding space (i.e., vector space) is a hyperparameter; one can set dimensionality as any positive integer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from keras.models import Sequential\n",
    "from keras.layers import *\n",
    "from keras.datasets import reuters\n",
    "from keras.preprocessing import sequence\n",
    "from keras.utils import to_categorical"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Embedding layer\n",
    "- As a result, embedding layer bears 3-D tensors\n",
    "    - Output shape = **(batch_size, input_length, output_dim)**\n",
    "    - input_dim: dimensionality of input space (number of unique tokens of interest)\n",
    "    - output_dim: dimensionality of embedding space\n",
    "    - input_length: length of input sequence (if None, can vary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# when input length is constant\n",
    "model = Sequential()\n",
    "model.add(Embedding(input_dim = 10, output_dim = 5, input_length = 3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(None, 3, 5)"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.output_shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# when input length varies\n",
    "model = Sequential()\n",
    "model.add(Embedding(input_dim = 10, output_dim = 5, input_length = None))    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(None, None, 5)"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.output_shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using embedding layer in network\n",
    "- Usually, embedding layer are used as first layer in network to model text format data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# parameters to import dataset\n",
    "num_words = 3000\n",
    "maxlen = 50"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "(X_train, y_train), (X_test, y_test) = reuters.load_data(num_words = num_words, maxlen = maxlen)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_train = sequence.pad_sequences(X_train, maxlen = maxlen, padding = 'post')\n",
    "X_test = sequence.pad_sequences(X_test, maxlen = maxlen, padding = 'post')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "y_train = to_categorical(y_train, num_classes = 46)\n",
    "y_test = to_categorical(y_test, num_classes = 46)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1595, 50)\n",
      "(399, 50)\n",
      "(1595, 46)\n",
      "(399, 46)\n"
     ]
    }
   ],
   "source": [
    "print(X_train.shape)\n",
    "print(X_test.shape)\n",
    "print(y_train.shape)\n",
    "print(y_test.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "input_dim = num_words\n",
    "output_dim = 100     # we set dimensionality of embedding space as 100\n",
    "input_length = maxlen"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def reuters_model():\n",
    "    model = Sequential()\n",
    "    model.add(Embedding(input_dim = input_dim, output_dim = output_dim, input_length = input_length))\n",
    "    model.add(CuDNNGRU(50, return_sequences = False))\n",
    "    model.add(Dense(100))\n",
    "    model.add(Activation('relu'))\n",
    "    model.add(Dense(46, activation = 'softmax'))\n",
    "    \n",
    "    model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])\n",
    "    return model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "model = reuters_model()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "embedding_6 (Embedding)      (None, 50, 100)           300000    \n",
      "_________________________________________________________________\n",
      "cu_dnngru_2 (CuDNNGRU)       (None, 50)                22800     \n",
      "_________________________________________________________________\n",
      "dense_3 (Dense)              (None, 100)               5100      \n",
      "_________________________________________________________________\n",
      "activation_2 (Activation)    (None, 100)               0         \n",
      "_________________________________________________________________\n",
      "dense_4 (Dense)              (None, 46)                4646      \n",
      "=================================================================\n",
      "Total params: 332,546\n",
      "Trainable params: 332,546\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n"
     ]
    }
   ],
   "source": [
    "model.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<keras.callbacks.History at 0x14ee99cecc0>"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.fit(X_train, y_train, epochs = 100, batch_size = 100, verbose = 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "224/399 [===============>..............] - ETA: 0s"
     ]
    }
   ],
   "source": [
    "result = model.evaluate(X_test, y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Test Accuracy:  0.854636592076\n"
     ]
    }
   ],
   "source": [
    "print('Test Accuracy: ', result[1])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
